White Wine Exploration by Arne Johan Dahl

Univariate Plots Section

Some basic information about the data:

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

The dataset contains 4898 observations with 12 features. Quality is an integer value, but apart from that all other features are numeric (float) values. The mean quality is 5.878 and median quality is 6. The highest quality is 9 and the lowest 3.

Fixed acidity ranges from 3.8 to 14.2, whith a median value of 6.8. About 75% of the wines have a volatile acidity less than 0.32 and a citric acidity less than 0.39. For both free sulphur dioxide content and total sulfur dioxide content, there is quite a difference between the max and min values (2.0 vs. 289.00 for free sulfur dioxide and 9.0 vs. 440.0 for total sulfur dioxide. There is less variation in the density of wines, with the minimum being 0.9871, the maximum 1.0390 and a median value of 0.9937. pH values are in the range between 2.72 and 3.82. Alcohol content varies between 8 and 14.2, with a median alcohol content of 10.40

Distribution of quality ranks

First, I look at the distribution of quality ranks.

## [1] 4898
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

There are 4898 wines in the data set. From tabling the values we see that each tail is thinly populated. There are only 20 observations having the lowest quality (3) and only 5 having the highest quality (9). There are far most observations having a judged quality of 6, 2198 out of 4898. From plotting a histogram showing the distribution, quality seems to be roughly normally distributed.


Distribution of the other variables

Getting a sense of the distribution of the other different variables.

It seems like many of the variables are somewhat normally distributed (although the binwidths are not adjusted).

A closer look on the different variables

The data set includes a description of each variable, and I decide to take a closer look at the distribution and relevant statistics for each variable. I also try to adjust binwidths very roughly in the different plots by looking at the scale on the x axis.

Fixed acidity

Fixed acidity is the amount of tartaric acid in the wine. It is measured in grams per litre (dm^3). It is one of the main acids found in wine, and is the source of “wine diamonds”, the small potassium bitartrate crystals that sometimes form spontaneously on the cork or bottom of the bottle. Chemically, tartaric acids lowers the pH during the fermentation process to a level where spoilage bacteria cannot live.

As can be seen, fixed acidity appears to be more or less normally distributed. Summary table containing some main statistics for volatile acidity:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

Volatile acidity

Volatile acidity is the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. It is measured in acetic acid - g / dm^3. Volatile acidity is roughly norally distributed, but has a much longer right tail, indicating outliers with a higher level of volatile acidity.

Summary table containing some main statistics for volatile acidity:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

Citric acid

Citric acid is measured in g / dm^3. According to the descrition in the dataset documentation, it is usually found in small quantities, and can add ‘freshness’ and flavor to wines. The distribution of citric acid is normally distributed. There does, however, seem to be “spikes” both at 0.5 grams and 0.75 grams - maybe due to rounding errors?

Summary table containing some main statistics for citric acid:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Residual sugar

Residual sugar is the the amount of sugar remaining after fermentation stops. It is measured in g / dm^3. Wines rarely contain less than 1g of residual sugar per litre. If a wine contains over 45 grams per litre, it is typically considered sweet.

Summary table containing some main statistics for residual sugar:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

After adjusting the binwidth, I’m intrigued by the “residual sugar” distribution, as it does not seem to be normally distributed.

Looking more closely at the distribution of residual sugar, including outliers

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

There is quite a bit of difference when it comes to residual sugar content of the different wines. The mean residual sugar value is 6.391 and the median 5.2. The minimum value is as little as 0.6, while the maximum is 65.8. The 1st quartile value is 1.7 while the 3rd quartile value is 9.9. I plot the distribution to have a closer look at the distribution, which reveals that most wines have a residual sugar content below 20, and with a spike between 1 and 2.

There appears to be some outliers to the far right in the plot, so I make a new plot where I zoom in to get a closer look, which reveals that there are only five wines having a residual sugar value above 25, and only three over 30 (two with 31.60 and one with 65.80 respectively).

I further table the wines with residual sugar values over 25:

Residual sugar values of wines with residual sugar > 25:

## [1] 31.60 31.60 65.80 26.05 26.05

Quality values of wines with residual sugar > 25:

## [1] 31.60 31.60 65.80 26.05 26.05
## [1] 6 6 6 6 6

All of these wines have a judged quality of 6, which is the most common quality level, so they don’t stand out quality-wise. REcalling that wines with a residual sugar value over 45 g per liter are considered sweet, it seems that the Vino Verde wines are not particularly sweet, with only one wine in the dataset being “sweet”.

Chlorides

Chlorides represent the amount of salt in the wine, and is measured in grams per liter (dm^3). The distribution appears to be bimodal, and with a long right tail.

Summary table containing some main statistics for chlorides:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Most wines have a salt content of between around 0.036 grams and 0.05 grams per liter.

Free sulfur dioxide

Free form SO2 is measured in milligrams (mg) per liter. SO2 is used as an additive, as it prevents microbial growth and the oxidation of wine. The distribution of free sulfur dioxide appears to be normal, with some outliers to the right.

Summary table containing some main statistics for free sulfur dioxide:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

Total sulfur dioxide

Total sulfur dioxide is the total of free and bound forms of S02, and as free sulfur dioxide, it is also measured in mg per litre. In low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. It is roughly normally distributed.

Summary table containing some main statistics for total sulfur dioxide:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

Density

Density is measured in grams per liter. The density of the wine depends on alcohol content and residual sugar content. It is (very roughly) normally distributed.

Summary table containing some main statistics for density:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

pH values

pH describes how acidic or basic a substance is on a scale from 0 (very acidic) to 14 (very basic). Most wines have a pH value between 3 and 4. It is normally distributed.

Summary table containing some main statistics for pH values:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

As can be seen from the summary table, also the wines in this data set have a value between 3 and 4.

Sulphates

Sulphates (potassium sulphate) is a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant. It is measured in grams per litre. It is normally distributed, with a longer right tail.

Summary table containing some main statistics for sulphate:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Alcohol

Looking more closely at the distribution of alcohol content. The minimum value is 8, and the maximum value 14.2. The median and mean values are 10.4 and 10.51 respectively.

Summary table containing some main statistics for volatile acidity:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

I’m intrigued by the “spikes” in the distribution of alcohol level, and add breaks to the x axis to see where they occur.

I wonder if the small “spikes” in the distribution coincide with e.g. round numbers (may e.g. be caused by the manufacturers reporting rounded numbers instead of accurate). I create a modulo function to calculate the ending decimal/modulus in order to get a impression of whether or not numbers are rounded. With this function I create a new variable, the modulus or ending decimal of the alcohol content of each wine. I then create a histogram to plot the distribution of the ending decimals.

In this data set, alcohol values seem to be stated in increments of 0.1. As shown in the plot above, I would argue that there is a higher frequency of wines with alcohol content corresponding to “round numbers”, with a ending decimal of 0 or 5. There are for instance 650 wines with a stated alcohol content with an ending decimal of 0, compared to 410 ending in 0.1 and 387 ending in 0.9. My guess is that this is due to the fact that some producers round the stated alcohol value to a round number.

In this case, this sort of POSSIBLE inaccuracies may not be of much significance, but each such inaccuracy has the potential to slightly affect all other analysis done on the data set (for example fitting linear models).

I will look more closely into the relation between alcohol content and other variables in the bivariate and multivariate analysis.

Univariate Analysis

What is the structure of your dataset?

The dataset contains 4898 observations with 12 features. Quality is an integer value, but apart from that all other features are numeric (float) values.

Mean quality is 5.878 and median quality is 6. Although the quality scale varies from 1 to 10, the highest quality is 9 and the lowest 3.

What is/are the main feature(s) of interest in your dataset?

Quality is the main feature of interest in this dataset.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I’m open minded as to which other features will support my investigation into the quality. I have no particular knowledge of wine chemistry, and as at the beginning of the investigation, I do not have any intuition as to which variables correlate with higher quality rankings.

Did you create any new variables from existing variables in the dataset?

I created a variable called alcohol.ending.decimal, which is the ending decimal of the stated alcohol content (alcohol contents is stated in increments of 0.1 %). I used the variable to plot the distribution of ending decimals to see if there were a higher occurence of “rounded” numbers with regard to alcohol content, which I believe there is. Since I did not plan on using the variable any further, I dropped it from the data set after conducting the analysis.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

As mentioned above in connection with the univariate plots most of the variables seemed to have roughly normal distributions, albeit with very long tails. Residual sugar and alcohol did not seem to be normally distributed.

I did not yet perform any operations to tidy or rearrange the date. The data seems relatively tidy, with each variable as a column and each observation as a row. However, since R studio can deal with numbering each observation (row), I removed the X column.

Bivariate Plots Section

Scatter plot matrix

I started my bivariate analysis by using ggpairs to get an overview of how the different variables relate to each other.

From the plot. I’m noting that alcohol and density seems to have some degree of correlation with the other variables in the data set, but that other than that there does not seem to be much correlation between the variables.

The main feature of interest is how the features of the data set relate to quality. I am therefore particularly interested in identifying features that are related to quality, and use this as a starting point for my analysis.

Boxplots of the different variables by quality level

I decided to try looking at the relation between the different features and quality by with boxplots, since they give an indication about the distribution of the variables at each quality level. I therefore plot boxplots of all variables against quality by using grid.arrange:


The relation between alcohol and quality

From running ggpairs to produce a scatter matrix, I recall that alcohol did have the highest correlation with quality (0.4355747). I want to look further into the relation between alcohol and quality. The below plot shows alcohol level by quality.

## [1] 0.4355747

There seems to be a tendency for lower quality wines to have lower alcohol content and better quality wines to have higher alcohol content. That being said, there seems to be quite a bith of variance - for example the lower quality wines seems to vary considerably with regard to alcohol content.


The relation between density and quality

Also density looks promising with regard to correlation (-0.3071233) and merits a closer look:

## [1] -0.3071233


Further analysis of wine SO2 level - the relation between total sulfur dioxide and free sulfur dioxide

Total.sulfur.dioxide and free.sulfur.dioxide seem to have some degree of correlation (0.615501), and I want to examine this in closer detail.

## [1] 0.615501

Sulfur dioxide (SO2) protects wine from oxidation and bacteria. However, too much of it can impact taste.

From this research I understand that free and total sulfur dioxide levels are related. This leaves my curious as to whether the PROPORTION of free to total sulfur dioxide levels have an impact on quality.

I decide to create a new variable free.sulfur.dioxide.proportion which is free.sulfur.dioxide/total.sulfur.dioxide, and plot the density distributions by quality:

## [1] -0.1747372
## [1] 0.008158067
## [1] 0.1972141

The correlation between the proportion of free.sulfur.dioxide to total.sulfur.dioxide does indeed increase by a tiny amount, but with a correlation with quality of 0.1972141, it is not a strong predictor of quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

As stated in the univariate plots section, I started my analysis of which variables were important for wine quality with an open mind. I therefore decided to plot all the variables in a boxplot using quality on the x axis. For several of the features (e.g. alcohol), there seem to be a polynominal/quadratic relation between the quality and the feature. This is e.g. the case with alcohol, where the highest and lowest quality wines have higher alcohol content, and the medium-low wines have lower alcohol content on average.

In particular there seems to be relations between alcohol and some other variables. In particular there seems to be a relation between alcohol and quality (the feature of interest). The correlation is 0.4355747.

## [1] 0.4355747

This trend can also be shown in a density plot:


Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Alcohol content seems to be related to other features, such as total.sulfur.dioxide (correlation of -0.7801376) and density (correlation of -0.4488921).

What was the strongest relationship you found?

The strongest relationship I found was the relationship between density and residual sugar. The correlation between these two variables is 0.8389665.

Multivariate Plots Section

Density by alcohol and quality

Recalling that alcohol and density seemed to have a degree of correlation, I want to see how this relates to quality by adding color for quality:

From the above plot it seems that wines of higher quality are typically higher in alcohol and lower in density. Even though I have used lower alpha and jitter, it might, however, be argued that the plot is overplotted. I therefore decide to facet the plot by quality level:

For instance, wines of quality 4 and 5 seems more likely to be clustered in the upper left corner of the plot, while wines of quality 7, 8 and 9 are more likely ot be clustered in the lower right side of the plot (indicating higher alcohol content and lower density). That being said, while I would argue that there is such a general tendency, there is also a great deal of variance for each quality level. So while a lower quality wine is more likely to be in the upper right part of the plot, this does not mean that all low quality wines will necessarily be in that part of the plot.


Looking at the relation between density and residual sugar by quality

There also appear to be some correlation between density and residual.sugar level, and I want to se how this relates to quality:

It appears that higher quality wines on general tend to have less density, and less residual sugar. Also in this plot, there is a danger of the plot being overplotted, and I decide to take a closer look at the distribution for each quality level, by facetting for quality. I’m adding horizontal and vertical lines with the median values of density and residual sugar respectively to make it easier to see how the distribution is placed relative to the median values of each variable:


Relation between total sulfur dioxide and density

The plot below shows the relation between total.sulfur.dioxide and density. From running ggpairs, I know they have one of the strongest correlations between the variables. By adding color for quality, I want to see if there is some relation to quality:

From the plot, it appears to me that higher quality wines have lower density and lower total.sulfur.dioxide. I decide to facet the plot over different quality levels to see if the distribution differs for each quality level:


The relation between free and total SO2 levels by quality

In the bivariate plots section, I looked at the relation between free and total SO2 levels. I want to have a look at this relation again, now adding color indicating quality level:

## [1] 0.615501

It appears that higher quality wines have less total.sulphur.dioxide and more free.sulphur.dioxide.

Since alcohol is the feature which in itself has the strongest relation with quality, I want to investigate the relation between alcohol and the free sulfur dioxide proportion and their relation to quality by adding color for quality:

Again, facetting by quality since the above plot may be overplotted:

From the above plot, it appears that higher quality wines have a higher level of alcohol, as well as a higher proportion of free sulfur dioxide.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

In the multivariate analysis, I looked at the relation between alcohol and density, which seem to strenghten each other in terms of looking at quality. Higher quality wines are typically higher in alcohol and lower in density. The features density and residual sugar also seemed to strengthen each other in terms of looking at quality, with higher quality wines tend to have less density, and less residual sugar. THis is also true for the relation between alcohol and the proportion of free sulfur dioxide. Higher quality wines have a higher level of alcohol, as well as a higher proportion of free sulfur dioxide.

Were there any interesting or surprising interactions between features?

I found it interesting that the distribution of pH values seemed to be so different across different quality levels. Given the low number of observations at the extreme ends of the quality spectrum, however, it is hard to say whether this is a result of a genuine difference between high and low quality wines, or whether it is particular just to this sample of white wines.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

N/A


Final Plots and Summary

Plot One

Description One

This plot shows the boxplot distributions of alcohol content for each quality level. For the three lowest quality levels (3-5), alcohol content seems to decrease with quality, i.e. the worst quality wines have a higher alcohol content than the better ones. For the higher quality wines, however, the opposite is true - from quality level 5 to 9, the alcohol level generally increases with quality. There therefore seems to be a polynominal/quadratic relation between the quality and the alcohol level.

The correlation value between alcohol and quality:

## [1] 0.4355747

Alcohol is the variable most strongly related to quality with a correlation of 0.4355747. While this does not imply a very strong correlation, I would argue it is significant, and that alcohol level impact the likelyhood of a wine being deemed to be of good quality.

Plot Two

Description Two

This plot shows the distribution of pH values across the groups of wines with the same quality. For the different quality levels, the distribution seems to vary. I have added vertical line with x intercept at the mean pH value for all white wines to make comparisons across the different quality levels (since mean is 3.188267 and median is 3.18 i did not feel the need to add both).

The distribution of pH values seem to vary with the quality level. The lowest quality wines are distributed more evenly across different pH values, ranging from about 2.7 to 3.7. High quality wines, on the other hand, seems to be distributed across a more narrow pH range, ranging from about 3.15 to 3.45. Further, the lower quality wines (especially 4 and 5) seems to have more pH values below the mean, while higher quality wines (8 and 9) have pH values above the mean pH value. This might suggest that more acidic/sour are judged to taste less good.

The correlation values between pH and quality:

## [1] 0.09942725

The correlation between pH value and quality is 0.09942725 which does not imply a strong correlation. While I would argue that some interesting trends may be seen in the above plot, pH values in itself is not a strong predictor of wine quality.

Plot Three

Description Three

I have created a new feature which is the proportion of free sulfate dioxide to total sulfate dioxide. The plot is a scatter plot with the proportion of free sulfur dioxide on the x axis and alcohol content on the y axis.

The plot is facetted by quality level, in order to see if the distribution varies with quality. Further, color is added indicating quality level.

Finally, I have added median values of both the x and y value (the median value across all quality level) as dotted lines, in order to make it easier to compare the distributon for a give quality level with the median value.

From the above plot, it appears that higher quality wines have a higher level of alcohol, as well as a higher proportion of free sulfur dioxide and therefore are more likely to be in the upper right part of the plot. Similarily lower quality wines have a tendency to have lower alcohol content and a lower proportion of free sulfur dioxide, and therefore being in the low left part of the plot. In lower quality wines, most of the wines seems to be placed in the lower left part of the plot, while for higher quality wines most of the wines are placed in the upper right part of the plot. There is, however, a great deal of variance for each quality level.


Reflection

The white wine data set contains information on 4898 white wine variants of the Portuguese “Vinho Verde” wine. My overall goal with the analysis was to uncover a relation between the different features and wine quality. From my analysis of the different features of the dataset, it appears that there is a connection between some of the features and wine quality. Alcohol level in particular appears to be correlated with higher quality wines. However, even though there is some relation between the different features it was not as pronounced as the strong, linear relation between price and carat in the diamonds dataset. I would say it was a bit disappointing not to uncover a stronger relationship. However, it would on the other hand be surprising if something as complex as the subjective taste of wine could be broken down to 12 chemical properties. There are likely interactions between the chemical properties that all work out to produce the subjective experience of the wine. Some of the analysis might be influenced by the fact that there are very few observations at the extreme ends of quality. For example there are only 20 observations of wines judged to be of quality 3 and only 5 for the highest quality wines judged to be of quality 9. The data set is only related to wines from a region in Portugal. It would be interesting to investigate whether the findings in this dataset would be different if wines from a different region or a range of regions were used. The data also seems to be limited to one year. It would also be interesting to see year on year change values, particularly as one often hear that wine producers talk about “good years” and “bad years”. It would be interesting to see if the chemical properties of the wine changes from a good year to a bad year.